Search CORE

8 research outputs found

The Network Nullspace Property for Compressed Sensing of Big Data over Networks

Author: Hulsebos Madelon
Jung Alexander
Publication venue
Publication date: 13/03/2018
Field of study

We present a novel condition, which we term the net- work nullspace property, which ensures accurate recovery of graph signals representing massive network-structured datasets from few signal values. The network nullspace property couples the cluster structure of the underlying network-structure with the geometry of the sampling set. Our results can be used to design efficient sampling strategies based on the network topology

arXiv.org e-Print Archive

Crossref

Directory of Open Access Journals

GitTables: A Large-Scale Corpus of Relational Tables

Author: Demiralp Çağatay
Groth Paul
Hulsebos Madelon
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 12/04/2023
Field of study

The success of deep learning has sparked interest in improving relational table tasks, like data preparation and search, with table representation models trained on large table corpora. Existing table corpora primarily contain tables extracted from HTML pages, limiting the capability to represent offline database tables. To train and evaluate high-capacity models for applications beyond the Web, we need resources with tables that resemble relational database tables. Here we introduce GitTables, a corpus of 1M relational tables extracted from GitHub. Our continuing curation aims at growing the corpus to at least 10M tables. Analyses of GitTables show that its structure, content, and topical coverage differ significantly from existing table corpora. We annotate table columns in GitTables with semantic types, hierarchical relations and descriptions from Schema.org and DBpedia. The evaluation of our annotation pipeline on the T2Dv2 benchmark illustrates that our approach provides results on par with human annotations. We present three applications of GitTables, demonstrating its value for learned semantic type detection models, schema completion methods, and benchmarks for table-to-KG matching, data search, and preparation. We make the corpus and code available at https://gittables.github.io

arXiv.org e-Print Archive

Outlier detection in multivariate time series: Exploiting reconstructions from random projections

Author: Hulsebos Madelon (author)
Publication venue
Publication date: 20/07/2018
Field of study

TU Delft Repository

Sherlock: A Deep Learning Approach to Semantic Data Type Detection

Author: Bakker Michiel
Demiralp Çagatay
Hidalgo César
Hu Kevin
Hulsebos Madelon
Kraska Tim
Satyanarayan Arvind
Zgraggen Emanuel
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 11/01/2021
Field of study

© 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. Correctly detecting the semantic type of data columns is crucial for data science tasks such as automated data cleaning, schema matching, and data discovery. Existing data preparation and analysis systems rely on dictionary lookups and regular expression matching to detect semantic types. However, these matching-based approaches often are not robust to dirty data and only detect a limited number of types. We introduce Sherlock, a multi-input deep neural network for detecting semantic types. We train Sherlock on 686, 765 data columns retrieved from the VizNet corpus by matching 78 semantic types from DBpedia to column headers. We characterize each matched column with 1, 588 features describing the statistical properties, character distributions, word embeddings, and paragraph vectors of column values. Sherlock achieves a support-weighted F1 score of 0.89, exceeding that of machine learning baselines, dictionary and regular expression benchmarks, and the consensus of crowdsourced annotations

DSpace@MIT

VizNet: Towards A Large-Scale Visualization Learning and Benchmarking Repository

Author: Bakker Michiel A
Demiralp Çağatay
Gaikwad Snehalkumar 'Neil' S
Hidalgo César
Hu Kevin
Hulsebos Madelon
Kraska Tim
Li Guoliang
Satyanarayan Arvind
Zgraggen Emanuel
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 11/01/2021
Field of study

© 2019 Copyright held by the owner/author(s). Researchers currently rely on ad hoc datasets to train automated visualization tools and evaluate the efectiveness of visualization designs. These exemplars often lack the characteristics of real-world datasets, and their one-of nature makes it difcult to compare diferent techniques. In this paper, we present VizNet: a large-scale corpus of over 31 million datasets compiled from open data repositories and online visualization galleries. On average, these datasets comprise 17 records over 3 dimensions and across the corpus, we fnd 51% of the dimensions record categorical data, 44% quantitative, and only 5% temporal. VizNet provides the necessary common baseline for comparing visualization design techniques, and developing benchmark models and algorithms for automating visual analysis. To demonstrate VizNet’s utility as a platform for conducting online crowdsourced experiments at scale, we replicate a prior study assessing the infuence of user task and data distribution on visual encoding efectiveness, and extend it by considering an additional task: outlier detection. To contend with running such studies at scale, we demonstrate how a metric of perceptual efectiveness can be learned from experimental results, and show its predictive power across test datasets

DSpace@MIT